cpu: add x86 AVX2/AVX512 SIMD fast paths for quantized MoE dot products#305
cpu: add x86 AVX2/AVX512 SIMD fast paths for quantized MoE dot products#305hexxyan wants to merge 1 commit into
Conversation
|
I've tried it on my old EPYC 7552 with 8-channel 512 GB DDR4 2666: Sorry for the different lengths. I haven't found a way to stop this model from thinking. This branch does increase the CPU performance and most of all: Using AVX2 or AVX512 is more efficient. But overall performance is still quite poor for a 153 GB model. |
|
The previous test had a CPU so overkill, that AVX didn't add much beside efficency. AMD EPYC Emebdded 3151 (4c/8T, Zen1) on a ITX Mainboard with 256GB DDR4 2133MHz in 2 channels: --> With this underpowered CPU, your branch results in 31% higher performance! In this case the measly 4 cores are a larger bottleneck than the memory bandwith. So AVX2 helps a lot! 👍👍 |





Summary
The x86 CPU path currently falls back to scalar code for all quantized dot products (Q2_K, Q4_K, IQ2_XXS) used in routed MoE expert inference. This PR adds AVX2 and AVX512BW SIMD fast paths, and explicit Makefile targets for building ISA-specific binaries.
The existing ARM NEON, scalar, Metal, and CUDA paths are unchanged.
Why does this matter?
Each MoE layer runs dot products against multiple expert weight blocks (typically 8 selected experts × gate/up/down projections). On x86 these are currently pure scalar loops — one multiply-accumulate per cycle per value. AVX2 processes 8×float32 or 16×int8 per cycle; AVX512 doubles that again. For CPU-only inference on large-RAM x86 machines, this should make routed expert evaluation substantially faster, directly improving tok/s.
What changed?
New SIMD dot product kernels (compile-time selected via
__AVX2__/__AVX512F__+__AVX512BW__):Q2_K × Q8_KQ4_K × Q8_KIQ2_XXS × Q8_KIQ2_XXS pair × Q8_KKey implementation details:
madd_epi16for signed dot productmaddubs_epi16(unsigned × signed byte), AVX512 variant withdpbusd_epi32when VNNI is availablemadd_epi16ds4_zext_i128_to_i256) for AVX512cvtepi8_epi16— avoids undefined upper-lane garbageNew Makefile targets for explicit ISA-specific binaries:
Each SIMD variant uses separate
build/cpu-<suffix>/object directories to avoid.ofile conflicts. Architecture gating uses$(filter x86_64 amd64,$(UNAME_M))— works across both Darwin and Linux. Non-x86 hosts get a clear error message.New test file:
tests/test_quant_dot.creplacestests/test_q2k_dot.c, covering all three quant formats:Validation status
Kernel correctness — tested via unit tests only:
make quant-dot-test— 11/11 pass (scalar vs AVX2 cross-checked on 100 random blocks per format)make cpu-avx2— builds all five binariesmake ds4_cpu.o -B— only pre-existing warnings-mavx512f -mavx512bwand-mavx512vnniE2E model inference not tested — I do not have enough RAM to load a DeepSeek V4 model and run actual inference. The dot product kernels are verified correct in isolation, but real-world tok/s and output quality need validation on a large-RAM machine.
Community testing needed. If you have a large-RAM x86 machine (AVX2 or AVX512):
Useful reports:
lscpu | grep -i avx)make quant-dot-testpassesNotes
Test Plan
make quant-dot-test— 11/11 pass (AVX2 cross-checked against scalar)make cpu— builds successfully, default behavior unchangedmake cpu-avx2— buildsds4-avx2and siblings with fixed-mavx2(no-march=native)git diff --checkclean